{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introductory example to ensemble models\n",
    "\n",
    "This first notebook aims at emphasizing the benefit of ensemble methods over\n",
    "simple models (e.g. decision tree, linear model, etc.). Combining simple\n",
    "models result in more powerful and robust models with less hassle.\n",
    "\n",
    "We will start by loading the california housing dataset. We recall that the\n",
    "goal in this dataset is to predict the median house value in some district in\n",
    "California based on demographic and geographic data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "data, target = fetch_california_housing(as_frame=True, return_X_y=True)\n",
    "target *= 100  # rescale the target in k$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will check the generalization performance of decision tree regressor with\n",
    "default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "tree = DecisionTreeRegressor(random_state=0)\n",
    "cv_results = cross_validate(tree, data, target, n_jobs=2)\n",
    "scores = cv_results[\"test_score\"]\n",
    "\n",
    "print(\n",
    "    \"R2 score obtained by cross-validation: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We obtain fair results. However, as we previously presented in the \"tree in\n",
    "depth\" notebook, this model needs to be tuned to overcome over- or\n",
    "under-fitting. Indeed, the default parameters will not necessarily lead to an\n",
    "optimal decision tree. Instead of using the default value, we should search\n",
    "via cross-validation the optimal value of the important parameters such as\n",
    "`max_depth`, `min_samples_split`, or `min_samples_leaf`.\n",
    "\n",
    "We recall that we need to tune these parameters, as decision trees tend to\n",
    "overfit the training data if we grow deep trees, but there are no rules on\n",
    "what each parameter should be set to. Thus, not making a search could lead us\n",
    "to have an underfitted or overfitted model.\n",
    "\n",
    "Now, we make a grid-search to tune the hyperparameters that we mentioned\n",
    "earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "param_grid = {\n",
    "    \"max_depth\": [5, 8, None],\n",
    "    \"min_samples_split\": [2, 10, 30, 50],\n",
    "    \"min_samples_leaf\": [0.01, 0.05, 0.1, 1],\n",
    "}\n",
    "cv = 3\n",
    "\n",
    "tree = GridSearchCV(\n",
    "    DecisionTreeRegressor(random_state=0),\n",
    "    param_grid=param_grid,\n",
    "    cv=cv,\n",
    "    n_jobs=2,\n",
    ")\n",
    "cv_results = cross_validate(\n",
    "    tree, data, target, n_jobs=2, return_estimator=True\n",
    ")\n",
    "scores = cv_results[\"test_score\"]\n",
    "\n",
    "print(\n",
    "    \"R2 score obtained by cross-validation: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that optimizing the hyperparameters will have a positive effect on the\n",
    "generalization performance. However, it comes with a higher computational\n",
    "cost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a dataframe storing the important information collected during\n",
    "the tuning of the parameters and investigate the results.\n",
    "\n",
    "Now we will use an ensemble method called bagging. More details about this\n",
    "method will be discussed in the next section. In short, this method will use a\n",
    "base regressor (i.e. decision tree regressors) and will train several of them\n",
    "on a slightly modified version of the training set. Then, the predictions of\n",
    "all these base regressors will be combined by averaging.\n",
    "\n",
    "Here, we will use 20 decision trees and check the fitting time as well as the\n",
    "generalization performance on the left-out testing data. It is important to\n",
    "note that we are not going to tune any parameter of the decision tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn.ensemble import BaggingRegressor\n",
    "\n",
    "estimator = DecisionTreeRegressor(random_state=0)\n",
    "bagging_regressor = BaggingRegressor(\n",
    "    estimator=estimator, n_estimators=20, random_state=0\n",
    ")\n",
    "\n",
    "cv_results = cross_validate(bagging_regressor, data, target, n_jobs=2)\n",
    "scores = cv_results[\"test_score\"]\n",
    "\n",
    "print(\n",
    "    \"R2 score obtained by cross-validation: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without searching for optimal hyperparameters, the overall generalization\n",
    "performance of the bagging regressor is better than a single decision tree. In\n",
    "addition, the computational cost is reduced in comparison of seeking for the\n",
    "optimal hyperparameters.\n",
    "\n",
    "This shows the motivation behind the use of an ensemble learner: it gives a\n",
    "relatively good baseline with decent generalization performance without any\n",
    "parameter tuning.\n",
    "\n",
    "Now, we will discuss in detail two ensemble families: bagging and boosting:\n",
    "\n",
    "* ensemble using bootstrap (e.g. bagging and random-forest);\n",
    "* ensemble using boosting (e.g. adaptive boosting and gradient-boosting\n",
    "  decision tree)."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}